Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems
نویسندگان
چکیده
The rapid development of language tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. The date, location, reporter and agency tags present in the web pages have been automatically named entity (NE) tagged. A portion of this partially NE tagged corpus has been manually annotated with the sixteen NE tags with the help of Sanchay Editor, a text editor for Indian languages. This NE tagged corpus contains 150K wordforms. Additionally, 30K wordforms have been manually annotated with the twelve NE tags as part of the IJCNLP08 NER Shared Task for South and South East Asian Languages 2 . A table driven semi-automatic NE tag conversion routine has been developed in order to convert the sixteen-NE tagged corpus to the twelve-NE tagged corpus. The 150K NE tagged corpus has been used to develop Named Entity Recognition (NER) system in Bengali using pattern directed shallow parsing approach, Hidden Markov Model (HMM), Maximum Entropy (ME) Model, Condi1 sourceforge.net/project/nlp-sanchay 2 http://ltrc.iiit.ac.in/ner-ssea-08 tional Random Field (CRF) and Support Vector Machine (SVM). Experimental results of the 10-fold cross validation test have demonstrated that the SVM based NER system performs the best with an overall F-Score of 91.8%.
منابع مشابه
PAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملپیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملNamed Entity Recognition using Support Vector Machine: A Language Independent Approach
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali a...
متن کاملBengali Named Entity Recognition Using Support Vector Machine
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is nowadays considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali usi...
متن کامل